Skip to content

feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip#210

Merged
louisliu2048 merged 12 commits intomainfrom
niven/flashblocks-no-reorg
Mar 27, 2026
Merged

feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip#210
louisliu2048 merged 12 commits intomainfrom
niven/flashblocks-no-reorg

Conversation

@sieniven
Copy link
Copy Markdown
Contributor

@sieniven sieniven commented Mar 20, 2026

Summary

The PR ensures zero re-org on subscribed flashblocks RPC nodes by ensuring p2p gossip to follower sequencers is a blocking successful operation first, before gossiping the flashblocks to RPC nodes. This allows for atomicity on flashblocks replay during sequencer switches / failures of the leader sequencer, to replay consistent gossiped flashblocks across RPC nodes - ensuring flashblocks re-org is guaranteed protected on RPC nodes. Note that the websocket publishing logic is shifted into the new broadcast submodule of the xlayer-builder crate.

In addition, on p2p node:

  1. On broadcast failures, logic fixed to immediately re-establish a stream if the tcp connection is still alive
  2. Retry interval reduced to 1s

Reorg considerations

In the case where p2p broadcast fails on the transport layer, the blocking broadcast to peers operation is only on the application level which may still cause a potential reorg.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sieniven
Copy link
Copy Markdown
Contributor Author

@claude review this PR

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 22, 2026

Claude finished @sieniven's task in 3m 31s —— View job


PR Review: feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

  • Gather context and understand the PR
  • Read changed files
  • Analyze logic correctness and atomicity guarantees
  • Check for bugs, edge cases, and error handling
  • Provide review feedback

Summary

Good architectural direction — removing ws_pub from the builder and consolidating publish responsibility into the p2p Node simplifies the builder and keeps WS publishing coordinated with p2p gossip. The Message trait → concrete Message enum refactor also reduces unnecessary generics. However, there are several issues worth addressing:


🔴 Critical: WS publishes before p2p gossip in Node::run() — opposite of the stated guarantee

crates/builder/src/p2p/mod.rs, lines ~157–162:

if let Message::OpFlashblockPayload(ref fb_payload) = message {
    let flashblock_byte_size = ws_pub.publish(fb_payload)?;   // ← WS first
    metrics.flashblock_byte_size_histogram.record(flashblock_byte_size as f64);
}
if let Err(e) = outgoing_streams_handler.broadcast_message(message).await {  // ← p2p second
    warn!(target: "flashblocks-p2p", "...");
}

The PR description states: "p2p gossip to follower sequencers is a blocking successful operation first before gossiping the flashblocks to RPC nodes."

But the code publishes to WebSocket subscribers before broadcast_message() to peers. This means the leader's own WS subscribers (RPC nodes) receive the flashblock before follower sequencers do, which is the exact re-org risk the PR claims to fix. The broadcast_message() call should come first, and ws_pub.publish() only after it returns (or at least returns without error).

Fix this →


🟡 Medium: ws_pub.publish()? kills the p2p node if broadcast channel has no receivers

crates/builder/src/p2p/mod.rs:

let flashblock_byte_size = ws_pub.publish(fb_payload)?;

WebSocketPublisher::publish() calls self.pipe.send(utf8_bytes) (a broadcast::Sender), which returns Err when there are no active receivers. The initial receiver created in WebSocketPublisher::new() is held by the listener_loop task, so in normal operation this is unlikely to fire. However, if the listener_loop task ever panics or exits unexpectedly, subsequent calls to ws_pub.publish() will kill the p2p node entirely via ?.

The previous code in builder.rs handled this with map_err(PayloadBuilderError::other)?, which only terminated the current build job (restartable), not the p2p node service. The handler's follower path (in handler.rs) correctly uses if let Err(e) = ws_pub.publish(...) { warn!(...) }. The Node path should follow the same pattern.

Fix this →


🟡 Medium: flashblock_byte_size_histogram metric silently drops when p2p is disabled

crates/builder/src/flashblocks/builder.rs (removed) vs crates/builder/src/p2p/mod.rs (added):

Previously, ctx.metrics.flashblock_byte_size_histogram.record(...) was called directly in the builder — always, regardless of p2p state. Now it's only recorded inside Node::run(), which only runs when p2p_enabled = true. Deployments with p2p_enabled = false will silently lose this metric after this change.


🟢 Good: behavior fix for no_tx_pool path

crates/builder/src/flashblocks/builder.rs:

Previously, built_fb_payload_tx.try_send(fb_payload) for the initial flashblock was guarded only by !rebuild_external_payload, meaning syncing FCUs (where no_tx_pool=true) would still trigger p2p gossip. The new code correctly gates it with !no_tx_pool && !rebuild_external_payload, consistent with the comment "not emitting flashblock if no_tx_pool in FCU, it's just syncing". This is a correct fix.


🟢 Good: Message trait → enum refactor

Removing the Message generic trait in favor of a concrete Message enum in types.rs is the right call for an internal protocol — reduces boilerplate, removes the generic parameter from Node<M>, and makes deserialization straightforward (serde_json::from_str::<Message>). The types.rs module is well-organized.


Minor: Default removed for NodeBuilder

Since NodeBuilder::new() now requires ws_pub and metrics, removing Default is correct. Worth verifying no external call sites use NodeBuilder::default() (doesn't appear to be any in the diff, but worth a quick grep if not already done).


Test coverage note

The test in p2p/mod.rs now uses OpFlashblockPayload::default() as the round-trip message and binds WebSocketPublisher to 127.0.0.1:0. Consider adding a test case that verifies the WS publish is called (or not) at the right point relative to p2p gossip, once the ordering fix is applied.


TL;DR: The main concern is the WS-before-p2p ordering in Node::run(), which inverts the core atomicity guarantee the PR is trying to establish. Suggest swapping ws_pub.publish() to run after broadcast_message(), and converting the ? to a warn! to avoid killing the p2p node on publish failure.

sieniven and others added 2 commits March 23, 2026 15:22
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… flag

Rename p2p/ module to broadcast/ to better reflect its dual role (p2p
gossip + WS publishing). Move wspub.rs into broadcast/, remove the
p2p_enabled config flag so the broadcast node always runs, fix WS
publish ordering to run after p2p gossip, and improve error handling
so ws_pub failures warn instead of crashing the node. Update stale
comments referencing the removed p2p_enabled flag.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@sieniven sieniven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude please help to review again, check that all your previous mentioned points have been resolved

@sieniven
Copy link
Copy Markdown
Contributor Author

@claude please help to review again, check that all your previous mentioned points have been resolved

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

Claude finished @sieniven's task in 2m 13s —— View job


PR Re-Review: feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

  • Gather context and read changed files
  • Check: WS publish ordering vs p2p gossip (🔴 Critical)
  • Check: ws_pub.publish()? error propagation (🟡 Medium)
  • Check: flashblock_byte_size_histogram metric with p2p disabled (🟡 Medium)
  • Check: Minor notes (NodeBuilder Default, test coverage)
  • Provide updated review feedback

All three previously raised issues have been resolved. Details below.

✅ Fixed: WS publish ordering (was 🔴 Critical)

crates/builder/src/broadcast/mod.rs, lines 169–181:

if let Err(e) = outgoing_streams_handler.broadcast_message(message.clone()).await {
    warn!(...)  // p2p first
}
if let Message::OpFlashblockPayload(ref fb_payload) = message {
    match ws_pub.publish(fb_payload) { ... }  // WS second
}

broadcast_message (p2p gossip) now runs before ws_pub.publish (WS). The ordering now matches the stated atomicity guarantee. ✅

✅ Fixed: ws_pub.publish()? error propagation (was 🟡 Medium)

The ? is replaced with a match that logs a warn! on failure — consistent with the follower path in handler.rs. The p2p node will no longer terminate if the WS broadcast channel has no receivers. ✅

✅ Fixed: Metric coverage when p2p disabled (was 🟡 Medium)

The p2p_enabled flag has been removed from args/op.rs. The broadcast node is now always spawned unconditionally in service.rs, so flashblock_byte_size_histogram is always recorded via Node::run(). ✅

✅ Resolved: NodeBuilder::default() call sites

No usages of NodeBuilder::default() exist in the codebase. service.rs correctly uses NodeBuilder::new(ws_pub, metrics). ✅

Data flow looks correct end-to-end

  • Leader path: builder → built_fb_payload_tx → handler built_fb_payload_rxp2p_txNode::run() outgoing_message_rx → p2p gossip first → WS publish second
  • Follower path: Node::run() incoming stream → handler p2p_rx → cache update → WS publish

No double-publishing between the two paths (outgoing vs. incoming are separate channels). ✅

Open (non-blocking): Test coverage

The test in broadcast/mod.rs validates node connectivity and message round-tripping. A dedicated test asserting WS publish is called after p2p gossip would make the ordering guarantee explicit and regression-proof, but this is not a blocker.


Summary: All three critical/medium issues from the previous review are fully addressed. The PR achieves its stated goal — p2p gossip to follower sequencers runs as a blocking operation before WS publish to RPC subscribers. LGTM.

sieniven and others added 5 commits March 25, 2026 12:47
…ntics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When `broadcast_message` fails to write to a peer's stream, the stream
was silently dropped but the peer entry remained in `StreamsHandler`,
causing every subsequent broadcast to emit a `no stream for protocol`
warning indefinitely — confirmed in production logs.

Two-part fix:
- `broadcast_message` now carries the `PeerId` through the error path,
  evicts the peer from the map on failure, and returns the list of
  failed peers to the caller instead of swallowing the error.
- `Node::run` immediately attempts `open_stream` for each failed peer
  after every broadcast, recovering the yamux stream on the existing
  TCP connection without waiting for the 60 s retry tick.

Also adds two unit tests in `mod.rs` that drive real libp2p swarms:
`broadcast_evicts_peer_on_stream_failure` and
`broadcast_returns_empty_failed_peers_on_success`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Speed up P2P peer reconnection by lowering the default retry interval,
improving flashblocks broadcast recovery after transient disconnects.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace test-only StreamProtocol constants with the actual
FLASHBLOCKS_STREAM_PROTOCOL to ensure tests exercise the same protocol
negotiation as production code.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Namespace all WebSocketPublisher log messages under
payload_builder::broadcast for easier filtering and log segregation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sieniven
Copy link
Copy Markdown
Contributor Author

@claude review the PR again

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 26, 2026

Claude finished @sieniven's task in 2m 35s —— View job


PR Review: feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

  • Gather context from previous reviews
  • Read current state of changed files
  • Analyze for any remaining or new issues
  • Provide review feedback

All three issues from the first review were confirmed fixed in the second review. This third pass verifies nothing has regressed and checks the latest commits (a338190, 4fcc2fc, 8236850, 8851252, 4d02bba).

✅ All previous issues remain fixed

WS ordering (broadcast/mod.rs:185–224): broadcast_message (p2p) runs first and fully awaits, then ws_pub.publish runs only after all peer writes have resolved. ✅

Error propagation: ws_pub.publish errors are handled with match ... Err(e) => warn!(...) — p2p node will not terminate on WS send failure. ✅

Metric coverage: p2p_enabled flag is gone from args/op.rs; broadcast node is always spawned; flashblock_byte_size_histogram is always recorded. ✅


New findings from latest commits

4d02bba — design comments: The added comment block at broadcast/mod.rs:170–184 is excellent. It clearly documents the "TCP kernel buffer" semantics and the intentional trade-off of not requiring ACKs from peers before WS publish. This makes the design intent explicit and reviewable.

8851252 — stream recovery on send failure: After broadcast_message returns failed_peers, the node immediately tries to re-open a new yamux stream to each failed peer (mod.rs:191–209). This is a nice improvement — it recovers from application-level stream closes without waiting for a full TCP reconnect cycle.

8236850 — peer retry interval reduced to 1s: Reasonable for a HA sequencer setup where failover latency matters.

4fcc2fc — real stream protocol in tests: Tests now use types::FLASHBLOCKS_STREAM_PROTOCOL instead of a fabricated protocol string, making them more faithful to production behavior.


One observation (non-blocking)

handler.rs:126p2p_tx.send(...).await creates backpressure

Some(payload) = built_fb_payload_rx.recv() => {
    let _ = p2p_tx.send(Message::from_flashblock_payload(payload)).await;
}

p2p_tx.send is an await-able call. If the broadcast node's outgoing_message_rx buffer fills up (capacity 100, mod.rs:448), this will block the handler's select loop, which would prevent built_payload_rx and p2p_rx from being polled. Eventually this causes built_fb_payload_tx.try_send() in builder.rs to return Err(Full) and flashblocks to be dropped. In practice the 100-message buffer should be far more than enough, so this is not likely to be a problem — just worth being aware of.


Data flow correctness (end-to-end)

Path Flow WS publish Correct?
Leader builder → built_fb_payload_tx → handler → p2p_txNode::run()p2p gossip → WS publish after p2p
Follower remote peer → IncomingStreamsHandler → handler p2p_rxcache + WS publish direct

No double-publish between paths (outgoing vs. incoming channels are distinct). ✅


LGTM. The PR achieves its stated goal cleanly. The design trade-off (no delivery ACK from peers, TCP kernel-buffer as the "happens-before" boundary) is clearly documented in comments and is a reasonable choice for low-latency flashblock streaming.

sieniven and others added 4 commits March 26, 2026 13:37
If `broadcast_message` returns `Err` (message serialization failure),
skip publishing the flashblock to WebSocket subscribers via `continue`.
Previously the ws publish ran unconditionally after the match block.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Upholds the reorg-safety invariant from PR #210: WebSocket publish to
RPC subscribers only happens if all follower sequencers received the
flashblock via P2P. If any peer's stream send failed, that peer is a
valid leader candidate without the flashblock — publishing to WS would
cause a reorg on sequencer switch.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the pre-loop `has_failed_peers` flag with iteration by
reference (`for &peer_id in &failed_peers`), keeping the vec accessible
after the loop so emptiness can be checked directly. PeerId is Copy so
the destructuring pattern is idiomatic.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sieniven
Copy link
Copy Markdown
Contributor Author

@claude review again

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 26, 2026

Claude finished @sieniven's task in 1m 48s —— View job


PR Re-Review (4th pass): feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

  • Gather context from previous reviews
  • Read current state of changed files (latest commits: 9d98db6, c498f71, b07040a, 7d7f01b, a338190)
  • Analyze for any new issues or regressions
  • Provide review feedback

All previously raised issues remain fixed. This pass reviews the 5 newest commits added since the last review.

✅ All previous fixes still hold

WS ordering (broadcast/mod.rs:185): broadcast_message (p2p) still runs and fully resolves before any ws_pub.publish call. ✅

Error propagation: ws_pub.publish errors still handled with match ... Err(e) => warn!(...). ✅

Metric coverage: broadcast node always spawned, flashblock_byte_size_histogram always recorded. ✅


New commits analysis

b07040a — skip ws publish if any peer failed (mod.rs:210–212)

if !failed_peers.is_empty() {
    continue;  // skip WS publish entirely
}

This tightens the atomicity guarantee: WS publish only fires when ALL peers successfully received the flashblock. One flashblock per stream failure is dropped to ws subscribers. The trade-off is intentional and consistent with the PR's zero-reorg goal. Worth noting: after the continue, the stream recovery loop has already attempted to re-open streams (mod.rs:191–209), so the next flashblock should be delivered normally. The one dropped flashblock per failure is acceptable.

7d7f01b — skip ws publish on serialization failure (mod.rs:214–217)

Err(e) => {
    warn!(...);
    continue;  // skip WS publish on serialize error
}

Consistent with the "p2p-first" design — serialization failures now also suppress WS publish. Correct.

c498f71 — iterate failed peers by ref (mod.rs:191)

for &peer_id in &failed_peers {

Clean refactor. Avoids consuming failed_peers early so the is_empty() check at line 210 still has access to the vec. Correct.

a338190 — log target rename
All tracing::warn! calls in wspub.rs now use target: "payload_builder::broadcast". Consistent with the rest of the broadcast module.

9d98db6 — change log level to warn
Changed a debug or info log to warn level. Appropriate for a failure path.


One observation (non-blocking)

b07040a interaction with stream recovery latency (mod.rs:191–209)

Stream recovery calls open_stream(...).await inside the outgoing_message_rx match arm, for each failed peer × each protocol. While Tokio can yield during these awaits, swarm events (swarm.select_next_some()) won't be processed until the entire arm returns. If multiple peers fail simultaneously and stream recovery is slow (e.g. peer overloaded), there's a window where swarm events accumulate. This was noted in the previous review as non-blocking — it remains non-blocking here, as the scenario requires simultaneous stream failures at a frequency faster than recovery.


Data flow (still correct end-to-end)

Path Flow WS publish Condition Correct?
Leader builder → handler → p2p_txNode::run()p2p gossip → WS publish after p2p only if all peers succeeded
Follower remote peer → IncomingStreamsHandler → handler p2p_rxWS publish direct always

No double-publish between paths. ✅


LGTM. The latest commits correctly tighten the atomicity guarantee (skip WS on any peer failure) and clean up minor refactors. The PR achieves its stated goal cleanly.

@louisliu2048 louisliu2048 merged commit e5f351b into main Mar 27, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants